[1] 0.15
Module 02
R will ignore any text after # for that line.
```{r}
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
```Complete the exercises in section 2.5 of R for Data Science.
dplyr basicsdplyr functions have the following characteristics:
Because the input and output are both data frames, you can string together multiple dplyr functions with the pipe operator |>.
Note
Previous versions of dplyr used the %>% operator which was imported from the magrittr package. The |> operator is now the preferred operator for piping in dplyr and is native to R.
dplyr Functions for Transforming Rows (1/2)filter()arrange()distinct()dplyr Functions for Transforming Rows (2/2)filter() : Extracts rows that meet a logical condition.arrange() : Reorders rows.distinct() : Finds all unique rows, usually based on a subset of columns.filter()Filter the penguins data frame to include only observations where the species is “Adelie”.
We place the species name in quotes because it is a character string.
Note
Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
filter() to find outliersUse body_mass_g variable of penguins data
penguins dataBased on the boxplot, only the Chinstrap species has outliers.
Outliers are observations have a value of 1.5 times the IQR. (\(IQR = Q_3 - Q_1\))
\[ \text{x} \begin{cases} \ge Q_3 + 1.5 \times IQR \\ \le Q_1 - 1.5 \times IQR \end{cases} \]
# Create a new data frame of the Chinstrap species
chinstrap <- penguins |> filter(species == "Chinstrap")
# Find the IQR, Q1 and Q3 values using stats functions
iqr_chinstrap <- IQR(chinstrap$body_mass_g, na.rm = TRUE)
q1_chinstrap <- quantile(chinstrap$body_mass_g, 0.25, na.rm = TRUE)[[1]]
q3_chinstrap <- quantile(chinstrap$body_mass_g, 0.75, na.rm = TRUE)[[1]]Notice the chinstrap$body_mass_g variable is used in the IQR() and quantile() functions. This is a common way to access variables in a data frame. Notice the na.rm = TRUE argument in the functions. This argument removes missing values from the calculations.
These functions are not part of the dplyr package, but are part of the stats package which is loaded when R starts
I’m also using the sprintf() function to format the output. The %.2f argument tells R to format the output as a floating point number with two decimal places. Check out the ?sprintf help file for more information on formatting output. We don’t often use this function with Markdown, but these types of functions are common in other languages like C and Python.
Outliers in the penguins data frame are observations where the species is “Chinstrap” and the body_mass_g is greater than or equal to 1.5 times the IQR or less than or equal to 1.5 times the IQR.
This is \(\ge\) 4.64kg and \(\le\) 2.79kg, respectively.
arrange() function to reorder rowsdistinct() function (1/2)distinct() function (2/2)Answer question 4 in section 3.2.5 Exercises from R for Data Science (2e)
dplyr Functions for Transforming Columns (1/4)mutate()Mutate may be the most powerful tool in the tidyverse.
We can use mutate() to create new columns or modify existing columns.
mutate() to create a new column (1/3)Simple mathematical transformation of an existing column: convert body_mass_g to kilograms.
mutate() to create a new column (2/3)We can use mutate() and use existing columns to create new columns.
mutate() to create a new column (3/3)We can apply logical conditions to create new columns.
mutate() in mind when you ask the question. . .Do I need to change something in my data frame?
select() to keep or drop columns (1/4)Probably the most straightforward function in dplyr.
Simply list the columns you want to keep, or…
select() to keep or drop columns (2/4)…list the columns you want to drop.
select() to keep or drop columns (3/4)You can also use the index of the column and use the : operator to select a range of columns.
select() to keep or drop columns (4/4)See ?select for more details. Once you know regular expressions (the topic of Chapter 15) you’ll also be able to use matches() to select variables that match a pattern.
You can rename variables as you select() them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:
relocate() to move columnsrelocate() is used to move columns to a new position in the data frame.
By default, variables are moved to the first position; however, the .after or .before arguments can be used to specify a new position.
|>We can string together multiple dplyr functions with the pipe operator |>.
Let’s find the five largest penguins by body mass, convert the mass to kg, and create a table showing the species, island, and body mass. Finally, we can rename the variables to make the table more readable.
For exploratory data analysis, we may not save the table as a new object.
group_by() and summarize()group_by() : Group data by one or more variables.summarize() : Summarize data by collapsing each group into a single row.group_by()Group the penguins data frame by species and island.
Look carefully at the new data frame. What’s changed?
The data frame has a new class attribute: “Groups”
summarize() on grouped dataSummarize the grouped data frame by calculating the mean body_mass_g for each
group_by() and summarize() togetherGroup the penguins data frame by species and island and summarize the data by calculating the mean body_mass_g for each.
Ask a question about the penguins data frame and use dplyr functions to answer it.
Use the final output of your analysis to create a table and ggplot2 visualization. For your visualization, include a title, subtitle, axis labels, and a caption.
# Question: What is the average body mass of penguins by Species and Island?
penguins_analysis <- penguins |>
group_by(species, island) |>
summarize(
count = n(),
mean_body_mass_g = mean(body_mass_g, na.rm = TRUE)
) |>
ungroup() |>
rename(Species = species, Island = island, Count = count, "Mean Body Mass (g)" = mean_body_mass_g)
penguins_analysispenguins_analysis |>
ggplot(aes(x = Island, y = `Mean Body Mass (g)`/1000, fill = Species)) +
geom_col(position = position_dodge(preserve = "single")) +
labs(title = "Average Body Mass of Penguins by Island",
subtitle = "Data from the palmerpenguins package",
x = "Island",
y = "Mean Body Mass (kg)",
caption = "Data source: palmerpenguins package") +
theme_classic()Applied Statistical Techniques